-
Notifications
You must be signed in to change notification settings - Fork 71
optimize u128/i128 popcounts further #354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks! Do you think it would be worth trying to make it work even for non-native integers by using the operations from the |
I'll give it a try - if it ends up working, I'll replace what I've posted here with whatever works. If not, I'll let you know what fails. |
I think I got a bug in libgccjit. When I compile with this patch, I get the following error: Details
|
Indeed. Good find! |
hmm, I wonder if replacing the assignment ops with just plain assignments would work. I'll give it a try. |
I added the type check in this commit. Yeah, that might work indeed. |
Sadly, while subbing out assignment ops with pure assignments works around the bug, it also results in poor codegen when 128-bit integers aren't supported and the fallback is skipped. Details
Before: popcount_128:
xor eax, eax
popcnt rax, rsi
test rsi, rsi
cmove rax, rsi
xor ecx, ecx
popcnt rcx, rdi
mov rdx, rax
add rax, rcx
test rdi, rdi
cmove rax, rdx
ret After: popcount_128:
push r14
mov rax, rsi
push r13
push r12
push rbp
push rbx
sub rsp, 32
or rax, rdi
mov QWORD PTR 16[rsp], rdi
mov QWORD PTR 24[rsp], rsi
vmovdqa xmm3, XMMWORD PTR 16[rsp]
vmovdqa XMMWORD PTR [rsp], xmm3
je .L67
mov rbp, rsi
mov rbx, rdi
xor r12d, r12d
.L66:
mov r14, QWORD PTR [rsp]
mov r13, QWORD PTR 8[rsp]
xor ecx, ecx
mov edx, 1
add r12d, 1
mov rdi, r14
mov rsi, r13
call __rust_u128_sub@PLT
xor ecx, ecx
mov edx, 1
mov rdi, r14
mov rsi, r13
and rbx, rax
mov QWORD PTR 16[rsp], rax
call __rust_u128_sub@PLT
vmovq xmm0, QWORD PTR 16[rsp]
mov rax, rbx
and rbp, rdx
vpinsrq xmm0, xmm0, rdx, 1
vpand xmm1, xmm0, XMMWORD PTR [rsp]
or rax, rbp
vmovdqa XMMWORD PTR [rsp], xmm1
jne .L66
mov eax, r12d
.L63:
add rsp, 32
pop rbx
pop rbp
pop r12
pop r13
pop r14
ret
.L67:
xor eax, eax
jmp .L63 However, when 128-bit integers are supported, we do see the improvement as expected. Details
Before: popcount_128:
xor eax, eax
popcnt rax, rsi
test rsi, rsi
cmove rax, rsi
xor ecx, ecx
popcnt rcx, rdi
mov rdx, rax
add rax, rcx
test rdi, rdi
cmove rax, rdx
ret After: popcount_128:
mov rdx, rdi
or rdx, rsi
je .L65
xor ecx, ecx
popcnt rsi, rsi
popcnt rcx, rdi
lea eax, [rsi+rcx]
ret
.L65:
xor eax, eax
ret I think if we gate the fallback behind native support for 128-bit integers, we'll get the best of both worlds - good-ish codegen when 128-bit integers are supported, and a roughly as-good fallback when they aren't. |
3d97550
to
77c2ccc
Compare
Do you know (or did you do some benchmarks to check) if having one jump is indeed faster than the 2 conditional moves? |
I wrote some microbenchmarks with |
Alright, I think I have some solid numbers (at least for my machine). My benchmarks run popcount 1000 times on the same number and add the results together. This is done to inflate iteration times a bit - with an operation as small as this, it's hard to get good numbers for how long it takes to complete. Before:
After:
Source: #![feature(test)]
pub fn popcount_128(a: u128) -> u32 {
a.count_ones()
}
#[cfg(test)]
pub mod tests {
use super::*;
use test::{Bencher, black_box};
#[bench]
fn bench_popcount_u128_0(b: &mut Bencher) {
let x = black_box(0u128);
b.iter(|| (0..1000).fold(0, |old, _new| old + popcount_128(black_box(x))))
}
#[bench]
fn bench_popcount_u128_1(b: &mut Bencher) {
let x = black_box(1u128);
b.iter(|| (0..1000).fold(0, |old, _new| old + popcount_128(black_box(x))))
}
#[bench]
fn bench_popcount_u128_10(b: &mut Bencher) {
let x = black_box(10u128);
b.iter(|| (0..1000).fold(0, |old, _new| old + popcount_128(black_box(x))))
}
#[bench]
fn bench_popcount_u128_minus_1(b: &mut Bencher) {
let x = black_box(-1i128 as u128);
b.iter(|| (0..1000).fold(0, |old, _new| old + popcount_128(black_box(x))))
}
} Not sure if other processor types will see similar results - I'm on a R7 5800X, but you might see different results on Intel processors/older AMD processors. I think a lot of the branch/conditional moves will go away once #351 gets resolved, so this might not even matter in the long term. |
Don't fall back on breaking apart the popcount operation if 128-bit integers are natively supported. Signed-off-by: Andy Sadler <[email protected]>
77c2ccc
to
81c1f39
Compare
Thanks a lot for your work! |
Don't fall back on breaking apart the popcount operation if 128-bit integers are natively supported.